Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR

نویسندگان

  • Fuchun Peng
  • Xiangji Huang
  • Dale Schuurmans
  • Nick Cercone
چکیده

It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. In this paper we show that, for Chinese, the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Finding the Better Indexing units for Chinese Information Retrieval

In the processing of Chinese documents and queries in information retrieval (IR), one has to identify the units that are used as indexes. Words and n-grams had been used as indexes in several previous studies, which showed that both kinds of indexes lead to comparable IR performances. In this study, we carried out more experiments to find the better way to index Chinese texts. First, we investi...

متن کامل

Evaluation via Negativa of Chinese Word Segmentation for Information Retrieval

Numerous studies have analyzed the influences of word segmentation (WS) performance on information retrieval (IR) for Mandarin Chinese and have demonstrated a non-monotonic relationship between WS accuracy and IR effectiveness. The usefulness of the compound words that have been a focus of the IR literature is not reflected by common WS evaluation metrics of word-based precision (P) and recall ...

متن کامل

Application of the Tightness Continuum Measure to Chinese Information Retrieval

Most word segmentation methods employed in Chinese Information Retrieval systems are based on a static dictionary or a model trained against a manually segmented corpus. These general segmentation approaches may not be optimal because they disregard information within semantic units. We propose a novel method for improving word-based Chinese IR, which performs segmentation according to the tigh...

متن کامل

English-Chinese Cross-Language IR Using Bilingual Dictionaries

This report describes the English-Chinese crosslanguage experiments at Berkeley for TREC-9 CrossLanguage Information Retrieval track. We present a simple and effective Chinese word segmentation method and compare the cross-language retrieval performance of two bilingual dictionaries for query translation.

متن کامل

عملکرد حافظه‌ی سرگذشتی و مسئله‌گشایی در پیوستار زندگی ـ مرگ : پژوهشی در بیماران افسرده

 Abstract Introduction: This study aimed to determine the relationship between generality in retrieval from autobiographical memory in depressed patients and functional deficit in problem-solving strategies. Method: This survey analyzed the findings of several previous studies that investigated the subject of retrieval from autobiographical memory and the process of problem-solving among four g...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002